-
Notifications
You must be signed in to change notification settings - Fork 12.4k
train: add simple loading already tokenized data from parquet dataset #14522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1bb0911
to
2574024
Compare
@JohannesGaessler what about my changes?) |
Sorry for the late reply. Generally speaking I would greatly prefer it if the training data were to be stored as GGUF files. That will make my life as a maintainer much easier since I won't have to deal with external dependencies. How about this: come up with a standardized way to define training data as GGUF, write code for constructing an |
some ~
|
Preferably use a prefix for the metadata and tensors. Looking at |
We will use the training. prefix for all keys to avoid conflicts with model metadata.
Tensors
|
I don't think you need an array with the sequence lengths per tensor since you can just query the shape of a tensor. I think it's enough to store the maximum sequence length (could also get this from iterating over tensors). Consider that people may also want to store untokenized datasets, I would suggest using uint8 + metadata for the encoding for those cases (it's fine if this use case is not implemented in this PR). |
dirty implementation of converter to new format https://github.com/lexasub/llama.cpp/tree/finetune-backup ) |
@JohannesGaessler firstly i add support of gguf - dataset #14622 |
also we need add streaming/batching, but this more complex task:)